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Abstract 

Sparse coding can learn good robust representation to 
noise and model more higher-order representation for 
image classification. However, the inference algorithm 
is computationally expensive even though the super¬ 
vised signals are used to learn compact and discrimina¬ 
tive dictionaries in sparse coding techniques. Luckily, 
a simplified neural network module (SNNM) has been 
proposed to directly learn the discriminative dictionar¬ 
ies for avoiding the expensive inference. But the SNNM 
module ignores the sparse representations. Therefore, 
we propose a sparse SNNM module by adding the 
mixed-norm regularization (ii/U norm). The sparse 
SNNM modules are further stacked to build a sparse 
deep stacking network (S-DSN). In the experiments, 
we evaluate S-DSN with four databases, including Ex¬ 
tended YaleB, AR, 15 scene and Caltech 101. Experi¬ 
mental results show that our model outperforms related 
classification methods with only a linear classifier. It is 
worth noting that we reach 98.8% recognition accuracy 
on 15 scene. 


Introduction 

It is well-known that sparse representations have a number 
of theoretical and practical advantages in computer vision 
and machine learning ( Lee et al. 2007[ Gregor and LeCun 
20 lot Yang et al. 2012|). In particular, sparse coding tech 


niques have led to promising results in image classification, 
e.g. face recognition and digit classification. Sparse cod¬ 
ing, as a generative model, is a very important way to ex¬ 
tract the sparse representations. However, sparse coding has 
the expensive inference algorithm and does not use the la¬ 
bel of the training data. Although some researchers use the 
supervised signals to learn compact and discriminative die 
tionaries (|Jiang, Lin, and Davis 2013] Zhuang et al. 2013 


|Huang etal. 201 3| l, the expensive inference algorithm is still 
a problem. Since it is to train the dictionaries by using the 
labels, do we directly learn the discriminative dictionaries 
for avoiding the expensive inference? 

Fortunately, a simplified neural network module (SNNM) 
(Deng and Yu 201 lai can directly train the discriminative 
dictionaries and fast calculate the representations. In the 
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SNNM, the input layer is non-linearly mapped to a hidden 
layer by using a projection matrix W and a sigmoid activa¬ 
tion function, and linearly mapped to an output layer by a 
matrix U. Clearly, W has discriminative ability because it 
is trained by minimizing the least squares error between the 
output vector and label vector. Moreover, SNNM can fast 
infer the hidden representation by only calculating a projec¬ 
tion multiplication and a nonlinear transformation. Follow¬ 
ing a stacked scheme ( [Wolpert 1992[ ), many SNNM modules 
are further stacked to build a Deep Stacking Network (DSN), 
which is previously named the Deep Convex Network (|Deng| 
and Yu 201 lb) !. Recently, DSN has received increasing at¬ 


tentions due to its successful application in speech classifi¬ 
cation and information retrieval (Deng, Yu, and Platt 2012 [ 
Deng, He, and Gao 2013|l. Additionally, the DSN is attrac¬ 


tive in that SNNM’s the batch-mode nature offers a poten¬ 
tial solution to the insurmountable problem of scalability 
in dealing with virtually unlimited amount of training data 
available nowadays ( Deng and Yu 2013] l. Therefore, we ex¬ 
tend DSN for image classification. 

Despite DSN’s success in speech classification, its frame¬ 
work also has several limitations. First, the conventional 
DSN only has used the sigmoid activation function for the 
nonlinear hidden layer ( Deng, Yu, and Platt 2012[ ). Although 
sigmoid has been widely used in the literature, it suffers 
from a number of drawbacks; for example the training can be 
slow, and with random initialization, the solution can stuck 
at a poor local solution that does not have good predictive 
performance ( |Glorot and Bengio 2010| l. In fact there are an¬ 
other two types of activation functions. The one is hyper¬ 
bolic tangent, which has been applied to training deep neural 
networks. It suffers from the same problems as those of sig¬ 
moid functions. A more recent proposal is the rectifier linear 
unit (ReLU) ( |Nair and Hinton 2010| l. It is observed that this 
method is very useful for object recognition and often trains 
significantly faster ( |Glorot, Bordes, and Bengio 2011 1. 

Second, sparse representations play a key role in im¬ 
age classification because they have the power to learn 
good robust features to noise, train gabor-like filters, and 


model more higher-order features (Ranzato et al. 2007 


[Lee, Ekanadham, and Ng 2008i . Evidently, sparse represen¬ 
tations have led to promising results in image classification 
( jJiang, Lin, and Davis 201 3| l. Furthermore, there is consider¬ 
able evidence that in brain the percentage of neurons active 



















































is between 1 and 4% ( |Lennie 2003| l. It is reasonable to con¬ 
sider the sparse representations in SNNM modules. How¬ 
ever, the conventional techniques for training SNNM com¬ 
pletely ignores the sparse representations. Generally, they 
can be achieved by penalizing non-zero activation of hid¬ 
den units (|Ranzato, Boureau, and LeCun 2008) or a devi 


ation of the expected activation of the hidden units (Lee, 
[Ekanadham, and Ng 2008| in neural networks. Moreover, 
in neural networks the local dependencies between hidden 
units can make hidden units for better modeling observed 
data ( |Luo et al. 201 l| l. But SNNM module restricted con¬ 
nections within hidden layer can not exhibit these depen¬ 
dencies. Fortunately, the hidden units without increasing the 
connections can be divided into non-overlapping groups for 
capturing the local dependencies among hidden units (|Luo 


et al. 201T|i. The local dependencies can be implemented by 


using I 1 /I 2 regularization upon the activation possibilities of 
hidden units in SNNM module. 

In light of the above argument, this paper exploits a Sparse 
Deep Stacking Network (S-DSN) for image classification. 
S-DSN is obtained by stacking the sparse SNNM modules, 
which consider the two activation function: ReLU and sig¬ 
moid; and use the group sparse penalties (h/h regulariza¬ 
tion) to penalize the hidden representations in SNNM mod¬ 
ular. Our S-DSN has many explicit advantages. First, com¬ 
pared with sparse coding technique (LC-KSVD piang, LinT 


[and Davis 2013| )), one-layer S-DSN can learn the projection 
dictionaries, which lead to a faster inference. Second, com¬ 
pared with DSN, S-DSN can extract sparse representations 
for learning good features in image classification. Last, S- 
DSN can retain the scalable structure of DSN. To conform 
the advantages of the S-DSN for image classification, exten¬ 
sive experiments have been performed on the four databases, 
including Extended YaleB, AR, 15 scene and CaltechlOl. 
Compared with multiple related methods, the experiments 
show that our model gets better classification results than 
other benchmark methods. In particular, we reach 98.8% 
recognition accuracy on 15 scene. 


Deep Stacking Network 

The DSN architecture is originally presented in the litera¬ 
ture ( [Deng and Yu 20irb| l. Deng and Yu explore an original 
strategy for building deep networks, based on stacking lay¬ 
ers of the basic SNNM modules, which take the simplified 
form of multilayer perceptron. We mathematically describe 
as follow: 

Let the target vectors = [Gi, • • • 
be arranged to form the columns of T = 
[ti,--- ,tjv] S Let the input data vec¬ 

tors Xi = [xii, • • • , Xji, • • • , XDi]'^ be arranged to form the 
columns of X = [xi, • • • ,Xi, ■ ■ ■ ,XAr] S . Formally, 

in the basic module, the lower-layer weight matrix, which is 
denoted by W S R^^^, connects the linear input layer and 
the nonlinear hidden layer. The upper-layer weight matrix, 
which is denoted by U S connects the nonlinear 

hidden layer with the linear output layer. The outputs of 
upper-layer is Y = U^H, where H = (t(W^X) S R^^^ 
is the hidden layer outputs and cr(a) = 1/(1 + e““) 


is the sigmoid activation function ( |Deng and Yu 201 la[ 
|Deng and Yu 201151 1. The parameters U and W are learned 
to minimize the least squares objective: 

min/d,„ = ||U^H-T|||, + a||U||?. (1) 

U,W 


where a is a regularization parameter of upper-layer weight 
matrix U. 

Clearly, U has a closed-form solution: 

U = (HH'^ -f aI)-iHT^ (2) 

By using a gradient descent ( |Deng and Yu 2011b|l a lgo- 
rithm to minimize the the least squares objective in ([ij) and 
deriving the gradient of W in the basic module, we obtain 


dfdsn 

aw 


2X[h'^o(I-H^)o(UU^H-UT)'^] (3) 


where o denotes element-wise multiplication and I is the ma¬ 
trix of all ones. 

The ’’convex” solution accentuates the role of convex op¬ 
timization in learning the output network weights U in each 
basic module ( |Deng and Yu 201 fa] l. Many basic modules are 
often stacked up with one on top of another to form a deep 
model. More specifically, the input units of a higher mod¬ 
ule can include the output units of the lowest module and 
optionally the raw input feature in the DSN (|Deng and Yu 


201 lb[). For obtaining the higher-order information in the 


data, DSN has recently been extended to Tensor-DSN (T- 
DSN) (Hutchinson, Deng, and Yu 20131, which’s the basic 
module is to replace a linear map from hidden representa¬ 
tion to output with a bilinear mapping. It retains the scalable 
structure of DSN and provides the higher-order feature in¬ 
teractions missing in DSN. 


Sparse Deep Stacking Network 

The S-DSN is a sparse case of the DSN. The stacking opera¬ 
tion of the S-DSN is the same as that for the DSN described 
in ( Deng and Yu 20lTbl l. The general paradigm is to use the 
output vector of lower module and the original input vector 
to form the expanded ’’input” vector for the higher module 
of the DSN. The modular architecture of S-DSN is different 
from that of DSN. We consider the sigmoid function and the 
ReLU function; and the sparse penalties are added into the 
hidden units of modular architecture. 


Sparse Module 

The output of upper-layer is Y = U^H and the hidden layer 
output is as follow: 

H = ^(W^X) G (4) 

where (j){a) is the sigmoid activation function a {a) or the 
ReLU activation function max(0, a). For simplicity, let R = 
{1, 2, • • • , L} denote the set of all hidden units. % is are 
divided into G groups, where G is the number of groups. 
The 5 th group is denoted by Qg, where R = U^=i Gg and 

r\% Gg = 0- So, H = [He,,; • • • ;He„:; • • • ;Heo,:]- 











































The parameters U and W are learned to minimize the least 
squares objective: 

^mfsdsu = IIU^H - Till + a||U||| + (5) 

u, w 

where a is a regularization parameter of upper-layer weight 
matrix U, /3 is a regularization constant of the activation of 
the hidden units and T'(H) represents the imposed penalty 
over sparse representations H. Typically, the li norm is con¬ 
ducted as a penalty to explicitly enforce sparsity on each 
sparse representation. It is described as: 

N 

'I'(H) = ^||h,||i (6) 

2 = 1 


where is the representation of zth example {i = 

In neural networks, sparse representations are advanta¬ 
geous for classification ( |Ranzato et al. 2007|l. Moreover, 
group sparse representations ( |Bengio et al. 2009[ ) can learn 
the statistical dependencies between hidden units and lead 
to better performance ( Luo et al. 201 l| l. To implement the 
dependencies, we averagely divide hidden units into non¬ 
overlapping groups to restrain the dependencies within these 
groups and force hidden units in a group to compete with 
each other ( |Luo et al. 201 1[ ). Luckily, a mixed-norm regular¬ 
ization (Zi/Z 2 -norm) can be conducted in the modular archi¬ 
tecture to achieve group sparse representations. Following 
(Luo et al. 20111, we consider the mixed-norm regulariza¬ 
tion, which is as follows: 


T-(H)=^|lHg^,:| 


1,2 


(7) 


3=1 


where ^ is the representation matrix associated to those 
intra-modality data belonging to the pth group and the I 1 /I 2 - 
norm is defined as 




i=l 



( 8 ) 


Learning Weights- Algorithm 

Once the lower-layer weight matrix W are fixed, H are also 
determined uniquely. Then solving the upper-layer weight 
matrix U can be formulated as a convex optimization prob¬ 
lem: 

mm/",,„ = ||U^H-T||| + a||U||| (9) 

which has a closed-form solution: 

U = (Hh"^ -f alj-^HT'^ (10) 


There are two algorithms for learning the lower-layer 
weight matrix W. First, given fixed current U, W can be 
optimized using a gradient descent algorithm (Deng and Yu 
201 la|l to minimize the squared error objective: 


min/Lsn = l|U^H-Till+ /3vI/(H) 


w 


( 11 ) 


Algorithm 1 Training Algorithm of Sparse Modular 
1: Input: Data matrix X, label matrix T, parameters 9 = 
{e, a, /3, G} and training epochs E. 

2: Initialize: Projection Matrix W are initialized with 
small random values^ 

3: Given W, calculate H by Eq. (|^. 

4: Update W by Eq. ( [T6| ). 

5: Repeat 3-4 E epochs (or until convergence). 

6 : Output weight matrix W. 


and deriving the gradient, we obtain 


dfh 


dW 


=2X d(()(H )o(UU^H-UT)^ 
-f 2/3X o o /H^ 


( 12 ) 


where o denotes element-wise multiplication, 0 / denotes 
element-wise division, H that it’s element is = 

c?^(H ) denotes element-wise gradient com¬ 
putation, d(j){a) is the gradient of the activation function. 
When (j){a) is the sigmoid activation function, d(j){a) = 
cr(a) X (1 — a(a)) and when ^(a) is the ReLU activation 
function, d(j){a) is described as: 


d(l){a) = 


1 , a > 0; 
0 , a < 0. 


(13) 


To ReLU activation function, we follow the hypothesis (Glo- 


|rot, Bordes, and Bengio 2011 1 that the hard non-linearities 
do not hurt the optimization so long as the gradient can be 
propagated to many hidden units. 

Second, for faster moving W towards a direction that finds 
the optimal points, the deterministic nonlinear relationship 
between U and W is used to compute the gradient. By plug¬ 
ging ((Tg into criterion 0, the least squares objective is 
rewritten as: 

=11 [(HH^ + alj-^HT^I^H - T|||+ 


w 


(HH 


(14) 


aI)-iHT"||| + ;3T'(H) 

However, when regularization is used in the objective 
function (|^ (i.e. a > 0), the gradient of can be very 
complicated. To simplify the gradient we assume a = 0 in 
■ So, second term o f is equivalent to zero. Similar 

d 

to (Deng and Yu 2011b 1 , then we derive the gradient 
ana obtain 

dfl 


aw 


=2X 


it 




itx 


(i0(H )o[H (HT')(TH )-T-'(TH )] 


-f 2j5X d(/)(H ) o H o /H 


(15) 


where H = H (HH ) ^ and d(j){-) and H are defined in 

CD. 

The algorithm then updates W using the gradient defined 
in ( [T2] ) and ( [TS] ) as 


W = W-e 


dfl 


sdsn 


aw 


or w = W - e 


df 

aw 


sdsn 


(16) 









































Algorithm 2 Training Algorithm of S-DSN 
1: Input; Data X, label T, parameters 0 = {e, a, /3, G}, 
training epochs E and the number of layers K. 

2: Initialize: = X and k = 1. 

3: while k < K 

4: Given X^, T, 0 and E, optimize by Algorithm I. 

5: Given X'^ and W'^, calculate h'" by Eq. 0, by Eq. 

([Tgi and = (u'=)^H^ 

6: X'^+i = [X;Y'=]. 

7: end while 

8 : Output weight matrix W^(fc = 1, • • • , K). 


where e is a learning rate. The weight matrix learning pro¬ 
cess is outlined in Algorithm I. 

The S-DSN Architecture 

The spare SNNM module described in the above subsection 
is used to construct the iT-layers S-DSN architecture, where 
K is the number of layers. In fcth spare module we denote 

k 

the input by X , hidden representations by H , output by 
Y^, label matrix by T and weight matrix by and U^. 
Given input data X and label T, when fc = 1, X^ = X. Then 
the general paradigm of S-DSN can be decomposed in three 
phases: 

• Step 1: Train the fcth sparse module to minimize the least 
squares error between Y^ and T. 

• Step 2: Generate the input of the fc -t- 1th sparse 

module by adding the output Y^ of fcth sparse module. 

• Step 3: Iterate as in Step 1 and Step 2 to construct the 
S-DSN architecture. 

We summarize the optimization of S-DSN in Algorithm 
2. Eor capturing the spare representation from raw data, this 
paper proposes the S-DSN, which is implemented by pe¬ 
nalizing the hidden unit activations and rectifying the nega¬ 
tive of outputs of hidden units activations. Due to the simple 
structure of each module, the S-DSN still retains the compu¬ 
tational advantage of the DSN in parallelism and scalability 
during learning all parameters. 

Experiments 

We present experimental results on four databases: the Ex¬ 
tended YaleB database, the AR face database, CaltechlOl 
and 15 scene categories. 

• Extended YaleB database: this database contains 2,414 
frontal face images of 38 people. There are about 64 im¬ 
ages for each person. The original images were cropped 
and normalized to 192 x 168 pixels. 

• AR database: this database consists of over 4,000 color 
images of 126 people. Each person has 26 face images 
taken during two sessions. These images include more 
facial variations, including different illumination con¬ 
ditions, different expressions, and different facial ’’dis¬ 
guises” (sunglasses and scarves). Eollowing the standard 


evaluation procedure, we use a subset of the database con¬ 
sisting of 2,600 images from 50 male subjects and 50 fe¬ 
male subjects. Each face image was also cropped and nor¬ 
malized to 165 X 120 pixels. 


• Caltech-101: this database [10] contains 9144 images be¬ 
longing to 101 classes, with about 40 to 800 images per 
class. Most images of Caltech-101 are with medium reso¬ 
lution, i.e., about 300 x 300. 


• 15-Scene: this data set, compiled by several researchers 
[11,20,24], contains a total of 4485 images falling into 15 
categories, with the number of images per category rang¬ 
ing from 200 to 400. The categories include living room, 
bedroom, kitchen, highway, mountain and et al. 


According to ( Jiang, Lin, _and Davis 2013| l, the four 
databases are preprocessed in the Extended YaleB 
database and AR face database, each face image is projected 
onto a n-dimensional feature vector with a randomly gener¬ 
ated matrix from a zero-mean normal distribution. The di¬ 
mension of a random-face feature in Extended YaleB is 504 
while the dimension in AR face is 540. In face databases 
the n-dimensional features of each image are normalized to 
[—1,1]. Eor the CaltechlOl database, we first extract sift de¬ 
scriptors from 16 X 16 patches, which are densely sampled 
from each image on a dense grid with 6 -pixels stepsize; then 
we extract the spatial pyramid feature based on the extracted 
sift features with three grids of size 1 x 1, 2 x 2 and 4 x 4. To 
train the codebook for spatial pyramid, we use the standard 
fc-means clustering with fc = 1024. Eor the 15 scene cate¬ 
gory database, we compute the spatial pyramid feature using 
a four-level spatial pyramid and a SIET-descriptor codebook 
with a size of 200. Einally, the spatial pyramid features are 
reduced to 3000 dimensions by PCA. 


The matrix parameters are initialized with small ran¬ 
dom values sampled from a normal distribution with 
zero mean and standard deviation of 0.01. Eor sim¬ 
plicity, we use the constant learning rate e cho¬ 
sen from {20,15,5,2,1,0.2,0.1,0.05,0.01,0.001}, the 
regularization parameter a chosen from (1,0.5,0.1}, 
the sparse regularization parameter /3 chosen from 
{0.1,0.05,0.01,0.001,0.0001} and the group number G 
chosen from {2,4,5,10,20}. In all experiments, we only 
train 5 epochs, the number of hidden units is 500 and the 
number of layers is 2. Eor each data set, each experiment 
is repeated 10 times with random selected training and test¬ 
ing images, and the average precision is reported. In the rest 
of this paper, we denote that S-DSN(sigm) indicates S-DSN 
with sigmoid function; S-DSN(relu) indicates S-DSN with 
ReLU function; DSN-1, S-DSN(sigm)-l and S-DSN(relu)- 
1 respectively indicate one-layer DSN, S-DSN(sigm) and 
S-DSN(relu); DSN-2, S-DSN(sigm)-2 and S-DSN(relu)-2 
respectively indicate two-layer DSN, S-DSN(sigm) and S- 
DSN(relu). 











DSN 


Table 1: Hoyer’s sparseness measures (HSM) on Extended 
YaleB and AR databases. We train on 15 (10) samples per 
category for Extended YaleB (AR) and the rest for testing. 
Eor two databases, the number of hidden units is 500, the 
group sizes for S-DSN(sigm) and S-DSN(relu) are 4 and the 
number of layers is 2. In Extended YaleB, e = 0.2 and a = 
0.5 are used for DSN; e = 0.2, a = 0.5 and /3 = 0.001 
are used for S-DSN(sigm). In Extended YaleB e = 0.05 and 
a = 1 are used for DSN; e = 0.05, a = 1 and /3 = 0.0001 
are used for S-DSN(relu). 



S-DSN(sigm) 

DSN 


layers 

HSM 

Acc. (%) 

HSM 

Acc. (%) 

Extended 

1 

0.096 

91.4 

0.010 

88.9 

YaleB 

2 

0.111 

92.0 

0.012 

89.4 


S-DSN(relu) 

DSN 


layers 

HSM 

Acc. (%) 

HSM 

Acc. (%) 

AR 

1 

0.286 

93.2 

0.003 

80.2 


2 

0.306 

93.5 

0.003 

81.2 


Table 2: Recognition Results Using Random-Eace Eeatures 
on the Extended YaleB Database 


Methods 

Acc. (%) 

Methods 

Acc. (%) 

SRC 

97.2 

LC-KSVD 

96.7 

DSN-1 

96.6 

DSN-2 

96.9 

S-DSN(sigm)-l 

96.9 

S-DSN(sigm)-2 

97.4 

S-DSN(relu)-l 

96.1 

S-DSN(relu)-2 

96.7 


Sparseness Comparisons 

Before presenting classihcation results, we first show the 
sparseness of S-DSN(sigm) and S-DSN(relu) compared to 
DSN. We use Hoyer’s sparseness measure (HSM) (Hoyer 
|2004| ) to figure out how sparse representations learned by 
the S-DSN(sigm), S-DSN(relu) and DSN. This measure has 
good properties, which is in the interval [ 0 , 1 ] and on a nor¬ 
malized scale. Its value more close to 1 means that there 
are more zero components in the vector. We perform com¬ 
parisons on Extended YaleB and AR databases, and results 
are reported in Table 1. The sparseness results show that 
S-DSN(sigm) and S-DSN(relu) have higher sparseness and 
higher recognition accuracy. Table 1 compares the network 
HSM of the S-DSN(sigm) and the S-DSN(relu) to that of 
DSN. We observe that the average sparseness of two layers 
S-DSN(sigm) is about 0.105 (Extended YaleB) and the av¬ 
erage sparseness of two layers S-DSN(relu) is about 0.291 
(AR). In contract, the average sparseness of two layers DSN 
is on average below 0.02 in the databases. It can be seen that 
the S-DSN can learn sparser representations. Due to space 
reasons, Eigure 1 only visualizes the activation probabili¬ 
ties of first hidden layer, which are computed under the S- 
DSN(relu) and the DSN given an image from test set of AR. 


Results 

Face Recognition Extended YaleB; We randomly select half 
(32) of the images per category as training and the other half 
for testing. The parameters are selected as follow; in DSN 

'they can be downloaded from: 

http://www.umiacs.umd.edu/~zhuolin/projectlcksvd.html 



0 100 200 300 400 500 

S-DSN(relu) 



hidden units 


Eigure 1; Activation probabilities of first hidden layer are 
computed under DSN and S-DSN(relu) on the AR database. 
Activation probabilities be normalized by dividing the max¬ 
imum of activation probabilities. 


Table 3; Recognition Results Using Random Eace Eeatures 
on the AR Eace Database 


Methods 

Acc. (%) 

Methods 

Acc. (%) 

SRC 

97.5 

LC-KSVD 

97.8 

DSN-1 

97.6 

DSN-2 

97.8 

S-DSN(sigm)-l 

97.9 

S-DSN(sigm)-2 

98.1 

S-DSN(relu)-l 

97.6 

S-DSN(relu)-2 

97.8 


e = 0.1 and a = 0.5; in S-DSN(sigm) e = 0.1, a = 0.5, 
G = 2, and /3 = 0.01; in S-DSN(relu) e = 0.01, a = 2, 
G = 5, and /3 = 0.001. AR; Eor each person, we randomly 
select 20 images for training and the other 6 for testing. In 
our experiments, e = 0.1 and a = 0.5 are used in DSN; 
e = 0.1, a = 0.5, G = 4, and /3 = 0.001 are used in S- 
DSN(sigm); e = 0.01, a = 1, G = 4, and /3 = 0.001 are 
used in S-DSN(relu). 

We compare S-DSN with DSN (De ng and Yu 201 rb)!, and 
LC-KSVD ( Jiang, Lin, and Davis 2013[ ) and SRCl Wright et 


al. 2009)1 algorithms, which reported state-of-the-art results 


on those two databases. The experimental results are sum¬ 
marized in Table 2 and Table 3, respectively. S-DSN(sigm) 
achieves better results than DSN, LC-KSVD and SRC. Erom 
Table 2 S-DSN(sigm)-l is better than LC-KSVD and has 
about 0.2% improvement in Extended YaleB. Erom Table 
3, S-DSN(sigm)-l and S-DSN(sigm)-2 are also better than 
LC-KSVD and have about 0.1% and 0.3% improvement in 
AR, respectively. In addition, we compare with LC-KSVD 
in terms of the computation time for classifying one test im¬ 
age. S-DSN has a faster inference because it can directly 
learn projection dictionaries. As shown in Table 4, S-DSN is 
7 times faster than LC-KSVD. 

15 Scene Category: Eollowing the common experimental 
settings, we randomly choose 100 images from each class 
for training data and the rest for test data. In our experiments, 
e = 20 and ol = 0.1 are used in DSN; e = 20, a = 0.1, 
G = 4, and /3 = 0.05 are used in S-DSN(sigm); e = 15, 
a = 0.1, G = 4, and /? = 0.001 are used in S-DSN(relu). 

We compare our results with SRC (jWright et al. 2009)1, 


Table 4; Inference Time (ms) for a Test Image on the Ex¬ 
tended YaleB Database 


Methods 

SRC 

LC-KSVD 

S-DSN(relu) 

Average time 

20.121 

0.502 

0.069 





























































Table 5: Recognition Results Using Spatial Pyramid Fea¬ 
tures on the 15 Scene Category Database 


Table 6: Recognition Results Using Spatial Pyramid Fea¬ 
tures on the CaltechlOl Database 


Methods 

Acc. (%) 

Methods 

Acc. (%) 

ITDL 

81.1 

ISPR-l-IFV 

91.1 

SR-LSR 

85.7 

ScSPM 

80.3 

LLC 

89.2 

SRC 

91.8 

LC-KSVD 

92.9 

DeepSC 

83.8 

DeCAF 

88.0 

DSFL-l-DeCAF 

92.8 

DSN-1 

96.7 

DSN-2 

97.0 

S-DSN(sigm)-l 

96.5 

S-DSN(sigm)-2 

97.1 

S-DSN(relu)-l 

98.8 

S-DSN(relu)-2 

98.8 
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livingroom D 
kitchen D 
industrial 9 
bedroom 3 
office 9 
tailbuilding 9 
street 9 
opencountry 9 
mountain 9 
insidecity 9 
highway 9 
forest 9 
coast 9 
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0 0 0 0 0 .9960 0 0 0 0 0 0 .001 
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0 0 .001.998.0010 0 .0000 0 .0050 .001.001 

0 0 .9880 0 0 0 0 0 0 .0000 0 0 

.000.986.0000 .0000 0 .0070 .009.002.0060 .003 

.9910 000000000000 




Figure 2; The confusion matrix on the 15 scene category 
database. 


LC-KSVD ( Jiang, Lin, and Davis 201 3| ), DeepSC (He et al. 


20141, DSN ( Deng and Yu 2011b|l and o ther state-of-the-art 


approaches: S cSPM (|Yang et al. 2009|), LLC ( fWang et al. 


M0|), ITDL (|Qiu, Patel, and Chellappa 2014| l, ISP R+IFV 


dUin et al. 20141l SR-LSR ( |Li and Guo 2014|l, DeCA F ( |Dom 
jahue et al. 2014| ), DSFL-tDeCAF ( Zuo et al. 2014| l. The de¬ 
tailed comparison results are shown in Table 5. Compared 
to LC-KSVD, S-DSN(relu)-l’s performance is much better, 
since it makes a 5.9% improvement. It also registers about 
1.8% improvement over the deep models: DeepSC, DeCAF, 
DSFL-tDeCAF and DSN. As Table 5 shows, we see that S- 
DSN(relu) performs best among all existing methods. The 
confusion matrix for the S-DSN(relu) are further shown in 
Figure 2, from which we can see that the misclassification 
errors of industrial and store are higher than others. 

CaltechlOl: Following the common experimental set¬ 
tings, we train on 5, 10, 15, 20, 25, and 30 samples per cate¬ 
gory and test on the rest. Due to space reasons, we only give 
the parameters for 30 training samples per category: e = 0.2 
and a = 0.5 are used in DSN; e = 0.2, a = 0.5, (7 = 4, 
and 13 = 0.01 are used in S-DSN(sigm); e = 0.05, a = 0.5, 
(7 = 2, and /3 = 0.001 are used in S-DSN(relu). 

We evaluate our approach using spatial pyramid features 
and compare with with SRC ( [Wright et al. 2009 1, LC-KSVD 
( Jiang, Lin, and Davis 2013| , DeepSC ( He et al. 2014| l, DSN 
(Deng and Yu 2011b|l and other approaches ScSPM (Yang 


et al 2009| l, L LC (jWa n g et al. 2010|l, LRSC ( jZhang et al. 


20131, LCLR ( Jiang, Guo, and Peng 2014] l. The average 


classification rates are reported in Table 6. From these re¬ 
sults, S-DSN(relu)-l outperforms the other competing dic¬ 
tionary learning approaches, including LC-KSVD, LRSC, 
and SRC; and has 1.6% improvement. S-DSN(relu) also reg¬ 
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68.7 
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73.5 

76.0 

S-DSN(relu)-2 

55.6 

64.2 

69.0 
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isters about 1.5% improvement over a deep model: DSN. 
Note that 76.5% accuracy achieved by our method (the num¬ 
ber of hidden units is 1000) is also competitive with the 
78.4% reported in DeepSC. 



Figure 3: Effect of the number of hidden units used in S- 
DSN(sigm), S-DSN(relu) and DSN on recognition accuracy. 

We examine how performance of the proposed S-DSN 
changes when varying the number of hidden units. We ran¬ 
domly select 30 images per category for training data and the 
rest for test data. We consider six settings where the num¬ 
ber of hidden units changes from 100 to 3000 and compare 
the results with DSN. As reported the results in Figure 3, 
our approaches maintain high classification accuracies and 
outperform the DSN model. When increasing the number 
of hidden units, the accuracy of the system improves for S- 
DSN(sigm), S-DSN(relu) and DSN. 

Effects of Number of Layers: The deep framework uti¬ 
lizes multiple-layers of feature abstraction to get a better rep¬ 
resentation for images. From Tables 2, 3, 5 and 6, we check 
the effect of varying the number of layers and the classifi¬ 
cation accuracy improves as the number of layers increases. 
In addition, compared to the dictionary learning approaches, 
S-DSN has a faster inference and a deep architecture. More¬ 
over, S-DSN has a good performance in image classification. 

Conclusion 

In this paper, we present an improved DSN model, S-DSN, 
for image classification. S-DSN is constructed by stacking 
many sparse SNNM modules. In each sparse SNNM mod¬ 
ule, the lower-layer weights and the upper-layer weights are 
solved by using the convex optimization and the gradient 







































































descent algorithm. We use the S-DSN to further extract the 
sparse representations from the random face features and 
spatial pyramid features for image classification. Experi¬ 
mental results show that S-DSN yields very good classifi¬ 
cation results on four public databases with only a linear 
classifier. 
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